Picture for Yifan Mai

Yifan Mai

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Add code
Jan 22, 2026
Viaarxiv icon

The Singapore Consensus on Global AI Safety Research Priorities

Add code
Jun 25, 2025
Figure 1 for The Singapore Consensus on Global AI Safety Research Priorities
Figure 2 for The Singapore Consensus on Global AI Safety Research Priorities
Figure 3 for The Singapore Consensus on Global AI Safety Research Priorities
Viaarxiv icon

Judging LLMs on a Simplex

Add code
May 28, 2025
Viaarxiv icon

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Add code
May 26, 2025
Figure 1 for MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Figure 2 for MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Figure 3 for MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Figure 4 for MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Viaarxiv icon

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Add code
Feb 26, 2025
Figure 1 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 2 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 3 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 4 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Viaarxiv icon

SEA-HELM: Southeast Asian Holistic Evaluation of Language Models

Add code
Feb 20, 2025
Figure 1 for SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Figure 2 for SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Figure 3 for SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Figure 4 for SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Viaarxiv icon

Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

Add code
Oct 29, 2024
Figure 1 for Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Figure 2 for Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Figure 3 for Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Figure 4 for Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Viaarxiv icon

Language model developers should report train-test overlap

Add code
Oct 10, 2024
Figure 1 for Language model developers should report train-test overlap
Figure 2 for Language model developers should report train-test overlap
Viaarxiv icon

VHELM: A Holistic Evaluation of Vision Language Models

Add code
Oct 09, 2024
Figure 1 for VHELM: A Holistic Evaluation of Vision Language Models
Figure 2 for VHELM: A Holistic Evaluation of Vision Language Models
Figure 3 for VHELM: A Holistic Evaluation of Vision Language Models
Figure 4 for VHELM: A Holistic Evaluation of Vision Language Models
Viaarxiv icon

Introducing v0.5 of the AI Safety Benchmark from MLCommons

Add code
Apr 18, 2024
Figure 1 for Introducing v0.5 of the AI Safety Benchmark from MLCommons
Figure 2 for Introducing v0.5 of the AI Safety Benchmark from MLCommons
Figure 3 for Introducing v0.5 of the AI Safety Benchmark from MLCommons
Figure 4 for Introducing v0.5 of the AI Safety Benchmark from MLCommons
Viaarxiv icon